11 research outputs found

    Vowel non-vowel based spectral warping and time scale modification for improvement in childrenā€™s ASR

    No full text
    Acoustic differences between childrenā€™s and adultsā€™ speech causes the degradation in the automatic speech recognition system performance when system trained on adultsā€™ speech and tested on childrenā€™s speech. The key acoustic mismatch factors are formant, speaking rate, and pitch. In this paper, we proposed a linear prediction based spectral warping method by using the knowledge of vowel and non-vowel regions in speech signals to mitigate the formant frequencies differences between child and adult speakers. The proposed method gives 31% relative improvement over the baseline system. We have also investigated time scale modification using RTISILA and SOLAFS algorithms and found that our proposed method performs better. Combining the proposed method with RTISILA and SOLAFS results in a further error rate reduction. The final combined system gives 49% relative improvement compared to the baseline system.Peer reviewe

    A Formant Modiļ¬cation Method for Improved ASR of Childrenā€™s Speech

    No full text
    Diļ¬€erences in acoustic characteristics between childrenā€™s and adultsā€™ speech degrade performance of automatic speech recognition systems when systems trained using adultsā€™ speech are used to recognize childrenā€™s speech. This per- formance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the diļ¬€erence in vocal tract resonances (formant frequencies) between adult and child speak- ers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of childrenā€™s speech to better correspond to formants of adultsā€™ speech. This is carried out by warping the linear prediction (LP) spec- trum computed from childrenā€™s speech. The warped LP spectra computed in a frame-based manner from childrenā€™s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is more similar to that of adultsā€™ speech. When used in testing of an ASR system trained using adultsā€™ speech, thewarping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of childrenā€™s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PFSTAR databases, respectively, and by recognising childrenā€™s speech using acoustic models trained with adultsā€™ speech. The proposed method gave rela- tive improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic mod- els, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recog- nition performance for the proposed method. We also combined the proposed method with VTLN and SRA, and found that the combined methodgave a fur- ther reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.Peer reviewe

    Speaker Verification Experiments for Adults and Children Using Shared Embedding Spaces

    No full text
    | openaire: EC/H2020/780069/EU//MeMADFor children, the system trained on a large corpus of adult speakers performed worse than a system trained on a much smaller corpus of childrenā€™s speech. This is due to the acoustic mismatch between training and testing data. To capture more acoustic variability we trained a shared system with mixed data from adults and children. The shared system yields the best EER for children with no degradation for adults. Thus, the single system trained with mixed data is applicable for speaker verification for both adults and children.Peer reviewe

    Spectral modification for recognition of childrenā€™s speech under mismatched conditions

    No full text
    n this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize childrenā€™s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes between children and adult speakers. The proposed method is used to improve the speech intelligibility to enhance the childrenā€™s speech recognition using an acoustic model trained on adult speech. In the experiments, WSJCAM0 and PFSTAR are used as databases for adultsā€™ and childrenā€™s speech, respectively. The proposed technique gives a significant improvement in the context of the DNN-HMM-based ASR. Furthermore, we validate the robustness of the technique by showing that it performs well also in mismatched noise conditions.Peer reviewe

    Using data augmentation and time-scale modification to improve ASR of childrenā€™s speech in noisy environments

    No full text
    Current ASR systems show poor performance in recognition of childrenā€™s speech in noisy environments because recognizers are typically trained with clean adultsā€™ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy childrenā€™s speech by investigating two techniques: data augmentation and time-scale modification. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency (F 0 ) and speaking rate of childrenā€™s speech are modified in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speakers. A standard ASR system based on DNNā€“HMM was built and the effects of data augmentation, F 0 modification, and speaking rate modification on word error rate (WER) were evaluated first separately and then by combining all three techniques. The experiments were conducted using childrenā€™s speech corrupted with additive noise of four different noise types in four different signal-to-noise (SNR) categories. The results show that the combination of all three techniques yielded the best ASR performance. As an example, the WER value averaged over all four noise types in the SNR category of 5 dB dropped from 32.30% to 12.09% when the baseline system, in which no data augmentation or time-scale modification were used, was replaced with a recognizer that was built using a combination of all three techniques. In summary, in recognizing noisy childrenā€™s speech with ASR systems trained with clean adult speech, considerable improvements in the recognition performance can be achieved by combining data augmentation basedon noise addition in the system training phase and time-scale modification based on modifying F 0 and speaking rate of childrenā€™s speech in the testing phase.Peer reviewe

    Synthesis Speech Based Data Augmentation for Low Resource ChildrenĀ ASR

    No full text
    Publisher Copyright: Ā© 2021, Springer Nature Switzerland AG.Successful speech recognition for children requires large training data with sufficient speaker variability. The collection of such a training database of childrenā€™s voices is challenging and very expensive for zero/low resource language like Punjabi. In this paper, the data scarcity issue of the low resourced language Punjabi is addressed through two levels of augmentation. The original training corpus is first augmented by modifying the prosody parameters for pitch and speaking rate. Our results show that the augmentation improves the system performance over the baseline system. Then the augmented data combined with original data and used to train the TTS system to generate synthesis data and extended dataset is further used for augmented by generating childrenā€™s utterances using text-to-speech synthesis and sampling the language model with methods that increase the acoustic and lexical diversity. The final speech recognition performance indicates a relative improvement of 50.10% with acoustic and 57.40% with language diversity based augmentation in comparison to that of the baseline system respectively.Peer reviewe

    Study of Formant Modification for Children ASR

    No full text
    The performance of automatic speech recognition systems for childrenā€™s speech is known to suffer from the large variation and mismatch in the acoustic and linguistic attributes between childrenā€™s and adultsā€™ speech. One of the various identified sources of mismatch is the difference in formant frequencies between adults and children. In this paper, we propose a formant modification method to mitigate differences between adultsā€™ and childrenā€™s speech and to improve the performance of ASR for children. The explored technique gives a relative 27% improvement in system performance compared to a hybrid DNN-HMM baseline. We also compare the system performance with related speaker adaptation methods like vocal tract length normalization (VTLN) and speaking rate adapta- tion (SRA) and find that the proposed method gives improvements over them, as well. Combining the proposed method with VTLN and SRA results in a further reduction of WER. We also found that the proposed method performs well even for noisy speech.Peer reviewe

    Data augmentation using prosody and false starts to recognize non-native children's speech

    No full text
    This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosody-based data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.Peer reviewe

    Data Augmentation Using Spectral Warping for Low Resource Children ASR

    No full text
    Funding Information: This work was supported by the Academy of Finland (grants 329267, 330139). Publisher Copyright: Ā© 2022, The Author(s).In low resource children automatic speech recognition (ASR) the performance is degraded due to limited acoustic and speaker variability available in small datasets. In this paper, we propose a spectral warping based data augmentation method to capture more acoustic and speaker variability. This is carried out by warping the linear prediction (LP) spectra computed from speech data. The warped LP spectra computed in a frame-based manner are used with the corresponding LP residuals to synthesize speech to capture more variability. The proposed augmentation method is shown to improve the ASR system performance over the baseline system. We have compared the proposed method with four well-known data augmentation methods: pitch scaling, speaking rate, SpecAug andĀ vocal tract length perturbation (VTLP), and found that the proposed method performs the best. Further, we have combined the proposed method with these existing data augmentation methods to improve the ASR system performance even more. The combined system consisting of the original data, VTLP, SpecAug and the proposed spectral warping method gave the best performance by a relative word error rate reduction of 32.13% and 10.51% over the baseline system for Punjabi children and TLT-school corpus, respectively. The proposed spectral warping method is publicly available at https://github.com/kathania/Spectral-Warping.Peer reviewe
    corecore